[1] 4
There are a lot of different ways to work with R. Here, we are going to focus on a few key things that will be helpful for your homeworks and emprical project
You can learn more about any R command by typing help(commandname). Sometimes these help files can be pretty complicated, don’t be afraid to consult me, the internet, or the resources below if you get stuck
For more detailed introductions to R’s built-in capabilities, see An introduction to R and FasteR: The fast lane to learning R
For more on the Tidyverse packages (and a few other things), see R for Data Science
You can download R from the R project website (Click on Download > CRAN, choose a nearby mirror, and choose the version for your operating system)
RStudio is a program that makes it easier to work in R. RStudio is free for non-commerical use, and can be downloaded from the Posit website
You can interact with R by typing commands directly into R’s “console window,” but this isn’t recommended
Instead, you should create an R Script (a file that ends with .R) in RStudio, that contains a list of commands that you want to run. This will give you a reproducible record of everything you did
After you’ve typed a command, you can run it by hitting Command + Enter (on macOS), or by highlighting it, then clicking on run (clicking run without highlighting anything will run the whole file)
R can do simple calculations, like
You can store an object in the “workspace” using “<-” (the “assignment operator”):
It’s more common to work with vectors, which are lists of numbers or characters
You can create them using c() (the “concatenate operator”):
[1] 1 5 10
[1] 0.2 1.0 2.0
[1] "Hello" "," "how" "are" "you" "?"
(Note that on the left, “c” is the name of the vector, while on the right it is a function. This is ok)
You can save your workspace using
And you can open an existing one using
setwd stands for “set working directory.” You need to customize this to the location of your file
You can also do all of this from the Session menu in RStudio
Packages are external commands that extend R’s capabilities
You can install them using
and load them (so that R can use them) using
Note that you use quotes in the first case but not the second
R can open lots of different kinds of data. We’ll focus on CSV (comma separated value) and Excel, since these are two of the most common formats
Both of these come with the Tidyverse set of packages
Our dataset contains information on state-level murder rates
We can import the CSV file using
This saves the data as a dataframe, which is basically just a list of vectors (the variables in the dataset, similar to a spreadsheet)
If the data are saved in Excel format, we can import then using
Once we’ve imported the dataset, we can take a quick look at it using
# A tibble: 6 × 13
id state year mrdrte exec unem d90 d93 cmrdrte cexec cunem cexec_1
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 AL 87 9.30 2 7.80 0 0 NA NA NA NA
2 1 AL 90 11.6 5 6.80 1 0 2.30 3 -1 NA
3 1 AL 93 11.6 2 7.5 0 1 0 -3 0.700 3
4 2 AK 87 10.1 0 10.8 0 0 NA NA NA NA
5 2 AK 90 7.5 0 6.90 1 0 -2.60 0 -3.90 NA
6 2 AK 93 9 0 7.60 0 1 1.5 0 0.700 0
# ℹ 1 more variable: cunem_1 <dbl>
You can view the entire dataset in RStudio by typing View(murder)
You can also get a list of all of the variables using str (which stands for “structure”):
tibble [153 × 13] (S3: tbl_df/tbl/data.frame)
$ id : num [1:153] 1 1 1 2 2 2 3 3 3 4 ...
$ state : chr [1:153] "AL" "AL" "AL" "AK" ...
$ year : num [1:153] 87 90 93 87 90 93 87 90 93 87 ...
$ mrdrte : num [1:153] 9.3 11.6 11.6 10.1 7.5 ...
$ exec : num [1:153] 2 5 2 0 0 0 0 0 3 0 ...
$ unem : num [1:153] 7.8 6.8 7.5 10.8 6.9 ...
$ d90 : num [1:153] 0 1 0 0 1 0 0 1 0 0 ...
$ d93 : num [1:153] 0 0 1 0 0 1 0 0 1 0 ...
$ cmrdrte: num [1:153] NA 2.3 0 NA -2.6 ...
$ cexec : num [1:153] NA 3 -3 NA 0 0 NA 0 3 NA ...
$ cunem : num [1:153] NA -1 0.7 NA -3.9 ...
$ cexec_1: num [1:153] NA NA 3 NA NA 0 NA NA 0 NA ...
$ cunem_1: num [1:153] NA NA -1 NA NA ...
For more complex graphs, I recommend using the ggplot2 package, which is part of the Tidyverse
We can look at the distribution of murder rates using:
aes(x = mrdrte) (the “aesthetic”) tells R which variable(s) we’re graphing
geom_histogram (the “geometric object”) tells R what kind of graph we want
We could look at the relationship between murder rates and executions using:
We can use additional “geoms” to add more to the graph. For example, to add a regression line, we can use:
If we wanted a different graph for each year, we could add facet_wrap (the scales = "free" option tells R to use different axes for each year):
You can obtain very simple summary statistics using summary:
id state year mrdrte exec
Min. : 1 Length:153 Min. :87 Min. : 0.800 Min. : 0.000
1st Qu.:13 Class :character 1st Qu.:87 1st Qu.: 3.900 1st Qu.: 0.000
Median :26 Mode :character Median :90 Median : 6.400 Median : 0.000
Mean :26 Mean :90 Mean : 8.071 Mean : 1.229
3rd Qu.:39 3rd Qu.:93 3rd Qu.:10.200 3rd Qu.: 1.000
Max. :51 Max. :93 Max. :78.500 Max. :34.000
unem d90 d93 cmrdrte
Min. : 2.200 Min. :0.0000 Min. :0.0000 Min. :-2.6000
1st Qu.: 4.900 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:-0.4000
Median : 5.800 Median :0.0000 Median :0.0000 Median : 0.3000
Mean : 5.973 Mean :0.3333 Mean :0.3333 Mean : 0.8422
3rd Qu.: 7.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 1.3000
Max. :12.000 Max. :1.0000 Max. :1.0000 Max. :41.6000
NA's :51
cexec cunem cexec_1 cunem_1
Min. :-11.0000 Min. :-5.800000 Min. :-11.0000 Min. :-5.8000
1st Qu.: 0.0000 1st Qu.:-1.075000 1st Qu.: 0.0000 1st Qu.:-1.9500
Median : 0.0000 Median : 0.300000 Median : 0.0000 Median :-1.0000
Mean : 0.1863 Mean : 0.005882 Mean : -0.2745 Mean :-0.8863
3rd Qu.: 0.0000 3rd Qu.: 1.000000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. : 23.0000 Max. : 3.600000 Max. : 5.0000 Max. : 3.1000
NA's :51 NA's :51 NA's :102 NA's :102
Another way to get quick summary statistics is using the describe function from the psych package (the :: notation is a way to use a package without loading it):
vars n mean sd median trimmed mad min max range skew
id 1 153 26.00 14.77 26.0 26.00 19.27 1.0 51.0 50.0 0.00
state* 2 153 26.00 14.77 26.0 26.00 19.27 1.0 51.0 50.0 0.00
year 3 153 90.00 2.46 90.0 90.00 4.45 87.0 93.0 6.0 0.00
mrdrte 4 153 8.07 9.19 6.4 6.88 4.45 0.8 78.5 77.7 5.95
exec 5 153 1.23 3.79 0.0 0.35 0.00 0.0 34.0 34.0 5.72
unem 6 153 5.97 1.68 5.8 5.89 1.48 2.2 12.0 9.8 0.65
d90 7 153 0.33 0.47 0.0 0.29 0.00 0.0 1.0 1.0 0.70
d93 8 153 0.33 0.47 0.0 0.29 0.00 0.0 1.0 1.0 0.70
cmrdrte 9 102 0.84 4.29 0.3 0.41 1.33 -2.6 41.6 44.2 8.40
cexec 10 102 0.19 2.95 0.0 0.04 0.00 -11.0 23.0 34.0 4.02
cunem 11 102 0.01 1.66 0.3 0.05 1.41 -5.8 3.6 9.4 -0.47
cexec_1 12 51 -0.27 2.19 0.0 -0.05 0.00 -11.0 5.0 16.0 -2.67
cunem_1 13 51 -0.89 1.73 -1.0 -0.95 1.63 -5.8 3.1 8.9 0.12
kurtosis se
id -1.22 1.19
state* -1.22 1.19
year -1.52 0.20
mrdrte 41.80 0.74
exec 40.36 0.31
unem 1.06 0.14
d90 -1.52 0.04
d93 -1.52 0.04
cmrdrte 76.90 0.42
cexec 35.09 0.29
cunem 0.55 0.16
cexec_1 11.22 0.31
cunem_1 0.47 0.24
You can get customized summaries using the Tidyverse packages
To get the mean and standard deviation of the murder rate as well as the number of observations, we can use
murder |> summarize(mean_mrdrte = mean(mrdrte),
sd_mrdrte = sd(mrdrte),
mean_exec = mean(exec),
sd_exec = sd(exec),
n = n()) # A tibble: 1 × 5
mean_mrdrte sd_mrdrte mean_exec sd_exec n
<dbl> <dbl> <dbl> <dbl> <int>
1 8.07 9.19 1.23 3.79 153
“|>” is the pipe operator, which tells R which dataframe we’re working with (so we don’t need the $ syntax)
If we only wanted to do this for a certain year, we could combine this with filter:
murder |> filter(year==87) |>
summarize(mean_mrdrte = mean(mrdrte),
sd_mrdrte = sd(mrdrte),
mean_exec = mean(exec),
sd_exec = sd(exec),
n = n()) # A tibble: 1 × 5
mean_mrdrte sd_mrdrte mean_exec sd_exec n
<dbl> <dbl> <dbl> <dbl> <int>
1 7.04 5.22 1.20 3.62 51
Note that if our command spans lines, we have to put |> at the end of the line
We we wanted to know these statistics for each year, we could use group_by:
murder |> group_by(year) |>
summarize(mean_mrdrte = mean(mrdrte),
sd_mrdrte = sd(mrdrte),
mean_exec = mean(exec),
sd_exec = sd(exec),
n = n()) # A tibble: 3 × 6
year mean_mrdrte sd_mrdrte mean_exec sd_exec n
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 87 7.04 5.22 1.20 3.62 51
2 90 8.44 10.6 0.922 2.18 51
3 93 8.73 10.7 1.57 5.06 51
You can add a new variable to a dataframe using
The $ tells R that the variable is part of the dataframe murder
If you are going to make a lot of new variables, it can be easier to use mutate from the Tidyverse packages:
The |> syntax is the pipe operator, which tells R which dataframe. you are working in
The (year==90) syntax is the indicator function that equals one if the condition in parentheses is true and zero otherwise
Note that we use double equals signs whenever evaluating whether a condition is true
We could change the values of the year variables using:
murder <- murder |> mutate(year = case_when(year == 87 ~ 1987,
year == 90 ~ 1990,
year == 93 ~ 1993))
murder |> head()# A tibble: 6 × 17
id state year mrdrte exec unem d90 d93 cmrdrte cexec cunem cexec_1
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 AL 1987 9.30 2 7.80 0 0 NA NA NA NA
2 1 AL 1990 11.6 5 6.80 1 0 2.30 3 -1 NA
3 1 AL 1993 11.6 2 7.5 0 1 0 -3 0.700 3
4 2 AK 1987 10.1 0 10.8 0 0 NA NA NA NA
5 2 AK 1990 7.5 0 6.90 1 0 -2.60 0 -3.90 NA
6 2 AK 1993 9 0 7.60 0 1 1.5 0 0.700 0
# ℹ 5 more variables: cunem_1 <dbl>, unem_sq <dbl>, exec_sq <dbl>,
# year90 <lgl>, year93 <lgl>
Suppose that you were working with the murder data, but wanted to add state\(\times\)year-level variables from another source? How could you merge these two sources?
First, let me create a fake dataset to merge in:
rnorm(n()) creates a normally distributed variable with the same number of observations as our dataset
Now we can merge these data using left_join:
murder <- murder |> left_join(fake, join_by(state, year))
murder |> select(state, year, mrdrte, newvar) |> head()# A tibble: 6 × 4
state year mrdrte newvar
<chr> <dbl> <dbl> <dbl>
1 AL 1987 9.30 0.457
2 AL 1990 11.6 -1.51
3 AL 1993 11.6 0.0975
4 AK 1987 10.1 0.718
5 AK 1990 7.5 0.322
6 AK 1993 9 -0.934
This is called a “left join” because it always keeps the original data, even if it doesn’t get matched to the new dataset (there are other types of joins that R can do, but this is the most common)
We will see examples of other techniques as the course progresses
For example, if we want to “run a regression” of murder rates on execution rates, we can use the “linear model” function `lm’ (we’ll learn what this means later):
This saves the result under model. To view the results, we can use
Call:
lm(formula = mrdrte ~ exec, data = murder)
Residuals:
Min 1Q Median 3Q Max
-6.966 -3.866 -1.566 1.898 70.734
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.7658 0.7800 9.957 <2e-16 ***
exec 0.2481 0.1963 1.264 0.208
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.175 on 151 degrees of freedom
Multiple R-squared: 0.01047, Adjusted R-squared: 0.003914
F-statistic: 1.597 on 1 and 151 DF, p-value: 0.2082
Quarto allows you to write documents and presentions using R and RStudio
I used Quarto to make these slides. You can use it for your homeworks and empirical presentation (if you want, you don’t have to)
To create a Quarto presentation, select File > New File > Quarto Document
To create a presentation, select Quarto Presentation instead
You can click Render to turn your file into an HTML, PDF, Word or PowerPoint file
The basic syntax looks like this:
If you prefer, it’s perfectly fine to work in Word, Powerpoint, Google Docs, etc. instead
If you do, when you are pasting R output into your file, please use a monospaced font (Courier New, Consalas, etc.) so that everything lines up correctly
If you hate the idea of coding in R (understandable), gretl allows you to do much (but not all) of the same things using a graphical interface
Gretl is free software that works on all major platforms
You can download it from the gretl website
You can import data using File > Open Data > User File, then selecting the file type (csv, Excel, …). Gretl might ask you some questions about the format of the data after you do this
Note: You can use File > Open Data > Sample File to see some sample datasets that might be helpful for your empirical projects
You can get basic descriptive statistics by clicking on View > Summary statistics
gretl can do more advanced summary statistics, but it requires coding that isn’t any easier than in R
You can plot a histogram by going to Variable > Frequency distribution
You can make a scatter plot by selecting View > Graph specified vars > X-Y scatter
You can add transformations of variables using the Add menu (the Define new variable option lets you use arbitrary expressions, like z = mrdrte + exec for the sum of the murder and execution rates)
You can run a regression by going to Model > Ordinary Least Squares, selecting the dependent and independent variables, and clicking ok
You can paste the output into Word, PowerPoint, etc.
From the output window, you can also do things like modify the model, run additional tests (that we will discuss later in the semester) or plot the predictions from the model
You can go to File > Save to session as icon to save the results for future reference
You can save your entire gretl session by going to File > Session files > Save session
Here is what a gretl session looks like: